May 8, 2019

Collaboration with NAU’s Pathogen and Microbiome Institute

Cluster Size Distributions

Defining \(\gamma\) = HAI rate from full data

  • For each cluster, the first time a strain is observed it is considered environmentally acquired.
  • The second (or third, or fourth, ..) time a strain is observed, it is healthcare acquired.

\[ \gamma = \frac{N - ||\mathcal{I}||}{N} = 1 - \frac{||\mathcal{I}||}{N}\] \[ N = \textrm{ Number of Patients }\] \[\mathcal{I} = \textrm{ Set of strain identifiers }\] \[||\mathcal{I}|| = \textrm{ Actual Number of Clusters/Strains }\]

  • Knowing \(||\mathcal{I}||\) is the key to calculating HAI rate!

Observed Number of Clusters/Strains under Simple Random Sampling

  • Define the following

\[\alpha = \textrm{ proportion of the population sampled }\] \[ n_i = \textrm{ actual size of the }i \textrm{th cluster}\] \[ m_i = \textrm{ observed size of the }i \textrm{th cluster}\]

Notice that

\[1 \le m_i \le n_i\] and \[\sum n_i = N\] \[ \sum m_i = \alpha N\]

\[ \widehat{HAI}_{naive} = 1 - \frac{||I||}{n} \] \[ n = \textrm{ sample size } \]

Conditional Distribution

\[m_i | n_i \sim \textrm{ZTHyperGeometric}(n_i, \; N-n_i, \;\alpha N) \; \textrm{for}\; i \in I\]

  • Zero Truncated HyperGeometric
  • Assume approximate independence between observed cluster sizes
  • Distribution requires working with hypergeometric terms

\[f(0|n_i) = \frac{ {n_i \choose 0}{N-n_i \choose \alpha N} }{ {N \choose \alpha N}}\]

Notice that \(\alpha\) and \(f(0|n_i)\) are inversely related and we could crudely approximate

\[f(0|n_i) \approx 1-\alpha\]

Critical Expectation

\[E[m_i] = E[ E(m_i|n_i)] = E[ (1-f(0|n_i))^{-1} \;\alpha \;n_i]\]

Utilizing this equation, can derive two different estimators.

  1. The plug-in estimator that ignores the expectation, and approximates \(\left[ 1-f(0)\right]^{-1} \approx \alpha^{-1}\). This results in \(\widehat{n}_i = m_i\).
  2. Ignoring the expectations, we could utilize the actual hypergeometric function for \(f(0|n_i)\) and solve the following equation for \(\widehat{n}_i\). This solution needs to be solved via numerical methods because the “chooses” in \(f(0|n_i)\).

Biased Estimator

  • Denoting

\[\widehat{n} = \sum\widehat{n}_i\] \[ I = \textrm{ Set of observed strains }\] \[ ||I|| = \textrm{ Observed Number of Clusters/Strains }\] \[\widehat{\gamma}^* = \frac{1}{\widehat{n}}\sum_{i\in I} (\widehat{n}_i-1) = \frac{\widehat{n} - ||I||}{\widehat{n}} = 1-\frac{||I||}{\widehat{n}}\]

Does the plug-in Estimator Work?

Why doesn’t this work?

Bias Correction Procedure

  1. Calculate the sample HAI rate.
  2. Repeatedly subsample the sample at the designated \(\alpha\) fraction.
  3. For each subsample, calculate the subsample’s HAI rate
  4. Look at the average discrepancy and use that to adjust the sample HAI rate estimate.
  5. The adjustments are made on the logit scale to force the resulting rate to remain in the \([0,1]\) interval.

Bias Correction Procedure - Math!

By repeatedly sub-sampling at \(\alpha\) rate \(J\) times and calculating \(\widehat{\gamma}^*_j\) for the \(j\)th sub-sample,

\[\bar{\delta} = \frac{1}{J}\sum\left[ \textrm{logit}(\widehat\gamma^*) - \textrm{logit}(\widehat\gamma^*_j) \right]\]

\[\widehat{\gamma} = \textrm{ilogit}\left( \textrm{logit}( \widehat{\gamma}^* ) + \bar\delta \right)\]

We performed the bias correction step on the logit scale to ensure the resulting estimator is in \([0,1]\).

Get approximate Confidence Intervals too!

  • Standard deviation of the \(\textrm{logit}(\widehat\gamma^*_j)\) values gives a estimated standard error of \(\textrm{logit}(\widehat{\gamma})\) value.
  • An approximate \(95%\) confidence interval for \(\gamma\) we use is to add/subtract

\[\textrm{ilogit} \left[ \textrm{logit}(\widehat\gamma) \pm Z_{0.975}*SE(\textrm{logit}(\widehat\gamma))\right]\]

Results

Plugin Results - Clinical Data

Hypergeometric Results - Clinical Data

Results - Simulated Populations

The Oxfordshire data could be reasonably modeled using a mixture of two distributions to separate the small clusters sizes from the large. We chose to model the small clusters sizes using a truncated Poisson distribution with the zero truncated out. The large cluster sizes were modeled from a logNormal distribution.

\[n_i \sim \begin{cases} \textrm{TPoisson}(\lambda) & \textrm{ with probability } 1 - \rho \\ \textrm{logNormal}(\mu, \sigma) &\textrm{ with probability } \rho \end{cases}\]

for \(i\) in \(\mathcal{I}\).

Simulated Data Populations

Simulated Data Populations: Results

Simulated Data Populations: Results